CopyVAE: a variational autoencoder-based approach for copy number variation inference using single-cell transcriptomics

Abstract Motivation Copy number variations (CNVs) are common genetic alterations in tumour cells. The delineation of CNVs holds promise for enhancing our comprehension of cancer progression. Moreover, accurate inference of CNVs from single-cell sequencing data is essential for unravelling intratumoral heterogeneity. However, existing inference methods face limitations in resolution and sensitivity. Results To address these challenges, we present CopyVAE, a deep learning framework based on a variational autoencoder architecture. Through experiments, we demonstrated that CopyVAE can accurately and reliably detect CNVs from data obtained using single-cell RNA sequencing. CopyVAE surpasses existing methods in terms of sensitivity and specificity. We also discussed CopyVAE’s potential to advance our understanding of genetic alterations and their impact on disease advancement. Availability and implementation CopyVAE is implemented and freely available under MIT license at https://github.com/kurtsemih/copyVAE.

Supplementary for CopyVAE: a variational autoencoder-based approach for copy number variation inference using single-cell transcriptomics 1 Methods

Identification of diploid cells
The model is applied to count matrix X and latent representations for each cell are obtained.These representations are clustered via k-means clustering method.For each cluster, the auto-correlation score is computed as below: where ρ m is the auto-correlation score, µ m is the mean expression profile, and σ 2 m is the variance of cluster m. x i and x j denote the expression profiles of the cells i and j in cluster m.
Based on the assumption that normal cells exhibit lesser diversity in contrast to tumour cells (R Gao et al., 2021), the cluster with the greatest autocorrelation is identified as the diploid cells.
The baseline expression level of each bin is computed as the mean expression level of that bin among the identified diploid cells.

The CopyVAE model
The expression count x nb of each bin b within each cell n undergoes a normalization process by referencing the bin b's specific baseline expression level.The normalized expression count then becomes the preliminary estimate of the copy number cnb of each bin b within each cell n.This preliminary estimate is referred to as the pseudo copy number and modelled as a sample drawn from a negative binomial (NB) distribution.Through a neural network, the unobserved latent random variable z n associated with each cell n is mapped to the parameters characterizing this NB distribution, utilizing the actual copy number estimate c nb for each bin b within each cell n.

1
The generative process for CopyVAE is described as follows: A standard multivariate normal prior is employed for the latent variable z n , as in vanilla variational autoencoders (Kingma & Welling, 2013): The latent variable z n is mapped to the unobserved copy number c nb via a neural network f c (NN2 in Figure 1 of the main document): The latent variable z n is also mapped to the dispersion parameter σ nb of each bin b in each cell n, via another neural network f d (NN3 in Figure 1 of the main document): Finally, the pseudo copy number cnb is modelled as a sample drawn from a NB distribution with the parameters c nb and σ nb : cnb ∼ NB c nb , σ nb Both neural networks f c and f d consist of two fully connected layers and they share the weights for the first layer.A rectified linear unit (ReLU) activation function is utilized between the layers.All the networks employ dropout regularization and batch normalization.The output of the function f c is passed through a sigmoid activation function and then scaled by a hyperparameter known as the maximum copy number.This scaling ensures that the copy number estimate remains within an upper limit.The output of f d is subjected to the softplus activation function to assure the non-negativity of the inferred dispersion.
The posterior distribution of the latent variable p(z n |x n ) is approximated by training another neural network(NN1 in Figure 1 of the main document) using variational inference, similar to Kingma & Welling, (2013) and Lopez et al. (2018).The variational distribution q(z n |x n ) is chosen to be multivariate Gaussian with a diagonal covariance matrix.The mean and covariance of this distribution are obtained by applying a neural network to x n .This network consists of one common fully connected layer and two separate additional layers: one for mean and one for covariance.A ReLU activation function is utilized between the layers.The common layer employs dropout regularization and batch normalization.The non-negativity of the inferred covariance is ensured by the exponentiation of the output of the covariance network.
The entire model is trained to optimize the evidence lower bound(ELBO): The first term on the right-hand side of the equation can be computed using the analytic expression for p(x|z), the reparameterization trick, and lowvariance Monte Carlo estimate of the expectation.The second term, on the other hand, can be computed using the closed-form analytic expression for the Kullback-Leibler divergence, which is possible while both the variational and the prior distributions are Gaussian distributions.
The parameters of the entire model are updated using an Adam optimizer (Kingma & Ba, 2014) with the learning rate of 1e-3 and the epsilon of 1e-2.Supplementary Table 1 provides the complete list of the hyperparameters involved in the model.The selection of the number of layers and the number of hidden neurons were motivated by the other recent deep learning models on scRNA-seq data [4,1,3,5].The hyperparameters of the Adam optimizer, namely the learning rate and the epsilon were chosen by a grid search so that the held-out log-likelihood is maximized, which is a generalizable approach.Latent dimension was shown to not have a credible effect on the performance of CopyVAE (see Supplementary 3.1.2).

Segmentation and clone profile estimation
For the segmentation process, we used the PELT algorithm, an exact multiple changepoint detection method, as proposed by Killick et al. (2012).The goal of the segmentation is to identify a set of breakpoints τ 1:m = (τ 1 , . . ., τ m ) in a given sequence of data points y 1:n = (y 1 , . . ., y n ) with an assumption τ 0 = 0 and τ m+1 = n, that minimises the total loss, as defined below: where L is the loss function and β is a penalty constant to prevent overfitting.β is selected to be proportional to the product of the clone size and the logarithm of the number of positions.The loss function L for the segment between 2 breakpoints τ a and τ b is defined as the negative maximum log-likelihood function: The observed data at a given location i, denoted by the variable y i , is the set of copy numbers within a clone of size k at bin position i: where c υi is the copy number of bin i in cell υ.Assuming independence and identical distribution of cell copy numbers at a particular position, we can compute the likelihood of y i , P (y i | θ), as follows: The PELT algorithm combines a recursive optimization method (Jackson et al., 2005) with a pruning technique, resulting in a computational cost for optimization that scales linearly with the number of cells.After determining the breakpoints for each tumour clone individually, single-cell copy numbers are obtained by computing the copy number of each segment between two breakpoints as the median value of the bin copy numbers within that segment.The single-cell copy number profiles are then aggregated by calculating the mean value of the segment copy numbers, resulting in the clone copy number profile.

Detection of loss of heterozygosity (LoH) regions
We utilise single-nucleotide polymorphism (SNP) variant allele frequency to detect loss of heterozygosity (LoH) regions.Because we assume that bulk DNA data is not necessarily available, we only use the scRNA-seq data.The process begins with selecting the SNPs with at least 5 percent population allele frequency in the 1000 Genomes project.Then for each of these SNPs the number of alternative and reference read counts in each cell is obtained using cell-snplite [2].Then only those SNPs that have an allele frequency between 0.1 and 0.9 in the dataset are kept as heterozygous SNPs.After running CopyVAE, we exclude the normal cells and then if any bin has an average minor allele frequency less than 0.1 in the tumor cell population and is reported as neutral or deleted by CopyVAE, we call an LOH event for that bin.For the synthetic datasets BM2-v0s0 to BM2-v0s4, we ran CopyVAE with six different bin sizes ranging from 5 to 75.We quantitatively assessed the method's performance on clustering and copy number inference.To assess clustering, for each run we computed the silhouette score as the mean silhouette coefficient of all cells in the latent space.Supplementary Figure 1 shows that the silhouette score is maximum for the bin size of 25 genes.The explanation is that too small bin sizes do not strive enough to reduce the effects of noise and dropouts, whereas too large bin sizes end up averaging out some useful discriminative information in data.To assess copy number inference, we employed the evaluation metrics explained in section 3.1 of the main text.Supplementary Figure 2 shows that the performance of the method decreases slightly with the increased bin size.One should note that the choice of bin size does not depend on the performance, but on the robustness.Even though smaller bin sizes perform better, they may not be robust for the clustering process as evident from their relatively lower silhouette scores.

Latent dimension
For the synthetic datasets BM2-v0s0 to BM2-v0s4, we ran CopyVAE with six different latent dimension ranging from 5 to 50.We quantitatively assessed the method's performance using the evaluation metrics explained in the Section 3.1 of the main text.Supplementary Figure 3 shows that the latent dimension does not have an credible effect on the performance of CopyVAE.

The effect of number of cells
Following the data generation pipeline explained in section 3.2.1 of the main text, we generated datasets consisting of different numbers of cells ranging from 200 to 1000.For each specific number of cells, we generated 5 different datasets (25 datasets in total).We evaluated the CopyVAE's performance on these datasets using the aforementioned evaluation metrics.Supplementary Figure 4 demonstrates that CopyVAE performs equally well for each number of cells.

The effect of tumor proportion
Following the data generation pipeline explained in section 3.2.1 of the main text, we generated datasets consisting of different tumour cell proportions: 0.6, 0.7, 0.8, 0.9.For each specific tumour proportion, we generated 5 different datasets (20 datasets in total).The CopyVAE's performance on these datasets was assessed using the previously mentioned evaluation metrics.Supplementary Figure 5 reveals that CopyVAE performs well for all the datasets, being independent of tumour proportion.

The effect of larger copy numbers
We introduced larger copy numbers up to 32 to the dataset BM2-v0s0, namely on a wide region in chromosome 9 and narrower regions in chromosome 15 (Supplementary Figure 6 -Top).We ran CopyVAE for this dataset keeping the maximum copy number hyperparameter as 6.The result showed that our method is quite robust for such a setting as well (Supplementary Figure 6 -Bottom).The amplified regions with copy numbers greater than 6 are estimated as having the maximum allowed copy number of 6.The other amplified and deleted regions are reconstructed correctly.A widespread low-level deletion artifact across the genome is observed.This artifact can be explained as caused by the extremely large copy numbers in the dataset affecting the baseline inference of CopyVAE.Due to those extremely amplified regions, normal cells' read counts are upscaled across the whole genome during normalization, giving a slightly overestimated baseline, which in turn leads to the aforementioned low-level deletion artifact.Besides these qualitative assessments, we also quantitatively evaluated the resulting profile comparing it to the ground truth.The comparison yielded in the Pearson correlation of 0.7372, the cosine similarity of 0.8964, the Euclidean distance of 0.0131, and the Manhattan distance of 0.6502, producing evidence of the robustness of CopyVAE against extremely large copy numbers.

The effect of correlation among CNVs
We simulated correlated CNVs using CNAsim [6], which is a recent method for an improved generation of single-cell CNVs via simulating an exponentially growing tumor population under neutral coalescence.CNVs generated by CNAsim possess complex clonal correlations.We tested CopyVAE on a synthetic dataset having such correlated CNVs (Supplementary Figure 7 -Top) and the results showed that those correlations do not appreciably affect the performance of CopyVAE (Supplementary Figure 7 -Bottom).We further evaluated the performance of CopyVAE by comparing it against the ground truth profile using the evaluation metrics.The comparison yielded in the Pearson correlation of 0.9079, the cosine similarity of 0.9799, the Euclidean distance of 0.0047, and the Manhattan distance of 0.4079, submitting evidence of the robustness of CopyVAE for correlated CNVs.

The effect of number of clones
We simulated a dataset consisting of 1000 cells and four clones, using CNAsim [6] to generate CNVs.We applied CopyVAE to this dataset.Supplementary Figure 8 shows that CopyVAE successfully detected the normal cells and the different clones.Compared against the ground truth labels, CopyVAE achieved a clustering accuracy of 0.947.CopyVAE performed well for the copy number profile estimation of each clone as well.As seen from Supplementary Figure 9, the profiles estimated by CopyVAE are in line with the ground truth profiles.Supplementary Table 3 shows also quantitatively that CopyVAE exhibits a stable performance among the different clones.

Fig. 2 :
The effect of bin size on CopyVAE's performance on the synthetic datasets BM2-v0s0 to BM2-v0s4, evaluated by the Pearson correlation (top-left), the cosine similarity (top-right), the average Euclidean distance (bottom-left), and the average Manhattan distance (bottom-right).
dist.Manhattan distance vs latent dim.Supp.Fig.The effect of latent dimension on CopyVAE's performance on the synthetic datasets BM2-v0s0 to BM2-v0s4, evaluated by the Pearson correlation (top-left), the cosine similarity (top-right), the average Euclidean distance (bottom-left), and the average Manhattan distance (bottom-right).

Fig. 4 :
The effect of number of cells on CopyVAE's performance, evaluated by the Pearson correlation (top-left), the cosine similarity (top-right), the average Euclidean distance (bottom-left), and the average Manhattan distance (bottom-right).

Fig. 5 :
The effect of tumour proportion on CopyVAE's performance, evaluated by the Pearson correlation (top-left), the cosine similarity (top-right), the average Euclidean distance (bottom-left), and the average Manhattan distance (bottom-right).

Supp. Fig. 9 : 2 Supp. Fig. 10 :
The performance of CopyVAE for data having four clones: a) Ground truth copy number profile of clone 1. b) Clone 1's copy number profile estimated by CopyVAE.c) Ground truth copy number profile of clone 2. d) Clone 2's copy number profile estimated by CopyVAE.e) Ground truth copy number profile of clone 3. f) Clone 3's copy number profile estimated by Copy-VAE.g) Ground truth copy number profile of clone 4. h) Clone 4's copy number profile estimated by CopyVAE.Loss curves of CopyVAE for the datasets BM2-v0s0, DCIS, and BCSA2.